Are certain types of properties getting a free ride on their taxes? 
Identify tranches of property suspected to follow this pattern (Vacant, Auto-Oriented, etc.)
Match these patterns to recorded sales
Aggregate sales data by block group
Vacants
Analysis was limited to the last five years
Analysis limited by data quality
Aggregation switched from census tracts to neighborhoods
Baltimore’s real property dataset provides additional attributes not provided at the state level like neighborhood and whether it’s unimproved property
All vacant building notices issued in Baltimore back to the 1970s
This data shows the property attributes of every piece of real property in Maryland, which includes location, assessment values, and some building characteristics.
Building Characteristics
Core
Combination of exploratory analysis and visualization
Relying heavily on dplyr and sf libraries for analysis
rdeck and tmap used for visualization
sales_valid_date |>
left_join(dates, by = join_by(property == BLOCKLOT)) |>
mutate(
vacant_at_sale = (date >= DateNotice) & ((is.na(date_terminate) | date <= date_terminate))
) |>
group_by(BLOCKLOT, sale) |>
summarise(across(everything(), first),
vacant_at_sale = any(vacant_at_sale, na.rm = TRUE)
) |>
ungroup() -> vacant_saleI was scratching my head trying to figure this out and asked ChatGPT [@ChatGPT] for help on how to approach this, which resulted in this elegant method for determining where there was a sale of vacant property.
left_join(sales, property_land_use, by = join_by(property == BLOCKLOT)) |>
group_by(property, transfer_no) |>
summarise(date = first(date),
price = first(price),
block = first(block.x),
property = first(property),
acct_id_full = first(acct_id_full.x),
vacant_at_sale = first(vacant_at_sale),
Land_Value = first(Land_Value),
Improvement_Value = first(Improvement_Value),
Total_Assessment = first(Total_Assessment),
NEIGHBOR = first(NEIGHBOR),
BL_DSCTYPE = first(BL_DSCTYPE),
BL_DSCSTYL = first(BL_DSCSTYL),
CM_DSCIUSE = first(CM_DSCIUSE),
NO_IMPRV = first(NO_IMPRV),
.groups = "keep") |>
mutate(identifier = case_when(
(NO_IMPRV == "Y" & is.na(BL_DSCTYPE)) ~ "unimproved",
vacant_at_sale ~ "vacant",
str_detect(BL_DSCTYPE, "AUTO|WAREHOUSE") ~ "unperforming",
.default = "regular"),
price_ratio = Total_Assessment / price) |>
ungroup() -> all_salesall_sales |>
group_by(NEIGHBOR, identifier) |>
summarise(med_price_ratio = median(price_ratio),
mean_price_ratio = mean(price_ratio),
med_price = median(price),
mean_price = mean(price),
n = n(),
.groups = "keep") |>
pivot_wider(id_cols = "NEIGHBOR",
names_from = "identifier",
names_glue = "{identifier}_{.value}",
values_from = c("med_price_ratio",
"mean_price_ratio",
"med_price",
"mean_price",
"n")) %>%
left_join(neighborhoods, ., by = join_by(Name == NEIGHBOR)) |>
mutate(pct_blk = Blk_AfAm / Population,
pct_wht = White / Population) |>
select(Name,
Population,
pct_blk,
pct_wht,
starts_with(c("vacant",
"unimproved",
"unperforming",
"regular"))) -> neighborhood_statsThe correlation shows a statistically significant (p-value = 0.00139) positive relationship between the median price and median sale to assessment ratio.
[1] 0.2137546
Despite what I was expecting, there is no clear pattern here.
This map shows neighborhoods where there was at least 25 sales of vacant homes. Based on this, we can see once again that at this level, the assessments subvert hypothesis because vacant price ratios are higher than regular sales.
Baltimore City and Maryland State data use separate keys (BLOCKLOT and Account ID)
I frequently found myself backtracking to add attributes to various intermediate datasets
Getting to the end and realizing that data isn’t clean enough for conclusions
I think my analysis does a good job in taking a very big dataset and distilling it to a more useable format. I aimed to make the data more tidy so that it would be easier to create comparisons
All data, code, an explanation of the analysis and inspirations are available through the README on my project repository